Language of Data

# Loading Data for Exercise.
library(openintro)
library(dplyr)
data(email50)

Types of Variables

  • Variable types helps us determine:
    • Summary statistics to calculate.
    • Types of visualization to make.
    • Statistical methods that are appropriate for answering questions.
  • Numerical(quantitative) Variables: numerical values.
    • Continuous: infinitie number of values within a given range, often measured.
    • Discrete: specific set of numerical values that can be counted or enumerated, often counted.
  • Categorical(qualitative) Variables: limited number of distinct categories.
    • Ordinal: finite number of values within a given range, often measured.
glimpse(email50)
## Observations: 50
## Variables: 21
## $ spam         <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0...
## $ to_multiple  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0...
## $ from         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ cc           <int> 0, 0, 4, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0...
## $ sent_email   <dbl> 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1...
## $ time         <dttm> 2012-01-04 21:19:16, 2012-02-17 04:10:06, 2012-0...
## $ image        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ attach       <dbl> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0...
## $ dollar       <dbl> 0, 0, 0, 0, 9, 0, 0, 0, 0, 23, 4, 0, 3, 2, 0, 0, ...
## $ winner       <fct> no, no, no, no, no, no, no, no, no, no, no, no, y...
## $ inherit      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ viagra       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ password     <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0...
## $ num_char     <dbl> 21.705, 7.011, 0.631, 2.454, 41.623, 0.057, 0.809...
## $ line_breaks  <int> 551, 183, 28, 61, 1088, 5, 17, 88, 242, 578, 1167...
## $ format       <dbl> 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1...
## $ re_subj      <dbl> 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1...
## $ exclaim_subj <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0...
## $ urgent_subj  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ exclaim_mess <dbl> 8, 1, 2, 1, 43, 0, 0, 2, 22, 3, 13, 1, 2, 2, 21, ...
## $ number       <fct> small, big, none, small, small, small, small, sma...

Categorical Data in R

  • Often stored as factors in R.
    • Important use: Statistical modeling.
    • Sometimes might be undesirable, sometimes it’s essential.
  • Common in subgroup analysis.
    • Only interested in a subset of the data.
    • Filter for specific levels of categorical variable.

Filter Data

# Subset of emails with big numbers: email50_big
email50_big <- email50 %>%
  filter(number == 'big')

Drop Unused Levels

# Table of the number variable
table(email50_big$number)
## 
##  none small   big 
##     0     0     7
# Drop levels
email50_big$number <- droplevels(email50_big$number)

# Another table of the number variable
table(email50_big$number)
## 
## big 
##   7


Discretize Variables

  • ifelse('logical test', 'if true', 'if false')

Using ifelse()

# Calculate median number of characters: med_num_char
med_num_char <- median(email50$num_char)

# Create num_char_cat variable in email50
email50_fortified <- email50 %>%
  mutate(num_char_cat = ifelse(num_char < med_num_char, 'below median', 'at or above median'))
  
# Count emails in each category
email50_fortified %>%
  count(num_char_cat)
## # A tibble: 2 x 2
##   num_char_cat           n
##   <chr>              <int>
## 1 at or above median    25
## 2 below median          25

The median marks the 50th percentile, or midpoint, of a distribution,
so half of the emails should fall in one category and the other half in the other.

Combining Levels of a Different Factor

# Create number_yn column in email50
email50_fortified <- email50 %>%
  mutate(number_yn = case_when(
    number == 'none' ~ "No", # if number is "none", make number_yn "no"
    number != 'none' ~ "Yes"  # if number is not "none", make number_yn "yes"
    )
  )
  
# Visualize number_yn
library(ggplot2)
ggplot(email50_fortified, aes(x = number_yn)) +
  geom_bar()

Visualization with ggplot2

Reference: Data Visualization with ggplot2 (I) (II) (III)

# Scatterplot of exclaim_mess vs. num_char
ggplot(email50, aes(x = num_char, y = exclaim_mess, color = factor(spam))) +
  geom_point()

ggplot2 automatically creates a helpful legend for the plot,
telling you which color corresponds to each level of the spam variable.


Study Types and Cautionary Tales

Study Types

  • Observational Study
    • Collect data in a way that does not directly interfere with how the data rise.
    • Only correlation can be inferred.
  • Experiment
    • Randomly assign subjects to various treatments.
    • Causation can be inferred.


Example1: Screens at Bedtime & Attention Span

Example2: Which is faster? Arial or Helvetica

A study is designed to evaluate whether people read text faster in Arial or Helvetica font. A group of volunteers who agreed to be a part of the study are randomly assigned to two groups: one where they read some text in Arial, and another where they read the same text in Helvetica. At the end, average reading speeds from the two groups are compared. 

Q: What type of study is this?

A: Experiment.

Example3: gapminder

# Glimpse data
library(gapminder)
glimpse(gapminder)
## Observations: 1,704
## Variables: 6
## $ country   <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, ...
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia...
## $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992...
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.8...
## $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 1488...
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 78...

Since there is no way to randomly assign countries to attributes,
this is an observational study.


Random Sampling & Random Assignment

  • Random Sampling: Casual
    • At selection of subjects from population.
    • Helps generalizability of results.
  • Random Assignment: Generalizable
    • Assignment of subjects to various treatments.
    • Helps infer causation from results.


One of the early studies linking smoking and lung cancer compared patients who are already hospitalized with lung cancer to similar patients without lung cancer (hospitalized for other reasons), and recorded whether each patient smoked. Then, proportions of smokers for patients with and without lung cancer were compared.

Q: Does this study employ random sampling and/or random assignment?

A: Neither random sampling nor random assignment.
Random assignment is not employed because the conditions are not imposed on the patients by the people conducting the study.    
Random sampling is not employed because the study records the patients who are already hospitalized,    
so it wouldn't be appropriate to apply the findings back to the population as a whole.    


Simpson’s paradox

Here we can see that the trend between x1 and y(gray dashed line), is reversed when x2(the grouping variable) is considered.
If we don’t consider x2, the relationship between x1 and y is positive.
If we do consider x2, we see that within each group the relationship between x1 and y is actually negative.


Example: ucb_admit

Overall

# Count number of male and female applicants admitted
ucb_admission_counts <- ucb_admit %>%
  count(Gender, Admit)
ucb_admission_counts
## # A tibble: 4 x 3
##   Gender Admit        n
##   <fct>  <fct>    <int>
## 1 Male   Admitted  1198
## 2 Male   Rejected  1493
## 3 Female Admitted   557
## 4 Female Rejected  1278
# Proportion of males admitted overall
ucb_admission_counts %>%
  group_by(Gender) %>%
  mutate(prop = n / sum(n)) %>%
  filter(Admit == "Admitted")
## # A tibble: 2 x 4
## # Groups:   Gender [2]
##   Gender Admit        n  prop
##   <fct>  <fct>    <int> <dbl>
## 1 Male   Admitted  1198 0.445
## 2 Female Admitted   557 0.304

It looks like 44% of males were admitted versus only 30% of females, but there’s more to the story.

Within most Departments

# Proportion of males admitted for each department
ucb_admission_counts <- ucb_admit %>%
  # Counts by department, then gender, then admission status
  count(Dept, Gender, Admit)
ucb_admission_counts
## # A tibble: 24 x 4
##    Dept  Gender Admit        n
##    <fct> <fct>  <fct>    <int>
##  1 A     Male   Admitted   512
##  2 A     Male   Rejected   313
##  3 A     Female Admitted    89
##  4 A     Female Rejected    19
##  5 B     Male   Admitted   353
##  6 B     Male   Rejected   207
##  7 B     Female Admitted    17
##  8 B     Female Rejected     8
##  9 C     Male   Admitted   120
## 10 C     Male   Rejected   205
## # ... with 14 more rows
ucb_admission_counts  %>%
  # Group by department, then gender
  group_by(Dept, Gender) %>%
  # Create new variable
  mutate(prop = n / sum(n)) %>%
  # Filter for male and admitted
  filter(Gender == "Female", Admit == "Admitted")
## # A tibble: 6 x 5
## # Groups:   Dept, Gender [6]
##   Dept  Gender Admit        n   prop
##   <fct> <fct>  <fct>    <int>  <dbl>
## 1 A     Female Admitted    89 0.824 
## 2 B     Female Admitted    17 0.68  
## 3 C     Female Admitted   202 0.341 
## 4 D     Female Admitted   131 0.349 
## 5 E     Female Admitted    94 0.239 
## 6 F     Female Admitted    24 0.0704

We can see that the proportion of males admitted varies wildly between departments.
Within most departments, female applicants are more likely to be admitted.


Conclusion of the Example

  • Overall: Males more likely to be admitted.
  • Within Most Departments: Felmales more likely to get admitted.
  • When controlling for department, relationship between gender & admission is reversed.
  • Potential Reason:
    • Women tended to apply tp competitive departments with low admission rates.
    • Men tended to apply to less competitive departments with high admission rates.


Sampling Strategies & Experimental Design

Sampling Strategies

Example: Randomly drawing names from a hat.
  • Simple Random Sample
    • Randomly select cases from the population.
    • Each case is equally likely to be selected.


Example: If we wanted to make sure that people from low, medium, and high socioeconomic status are equally represented in a study, we would first divide our population into three groups as such and then sample from within each group.
  • Stratified Sample
    • We first divide the population into homogeneous groups, called strata.
    • Then we randomly sample from within each stratum.


  • Cluster Sample
    • Divide the population into clusters.
    • Randomly sample a few clusters.
    • Sample all observations within these clusters.
    • The clusters, unlike strata in stratified sampling, are heterogeneous within themselves.
    • Each cluster is similar to the others, such that we can get away with sampling from just a few of the clusters.


Cluster and multistage sampling are often used for economical reasons.
Example: one might divide a city into geographic regions that are on average similar to each other and then sample randomly from a few randomly picked regions in order to avoid traveling to all regions.
  • Multistage Sample
    • Multistage sampling adds another step to cluster sampling.
    • Divide the population into clusters.
    • Randomly sample a few clusters.
    • Randomly sample observations from within those clusters.


Sampling in R

Simple Random Sample

us_regions <- get(load('D:/Downloads/us_regions.RData'))
# Simple random sample: states_srs
states_srs <- us_regions %>%
  sample_n(size = 8)

# Count states by region
states_srs %>%
  count(region)
## # A tibble: 4 x 2
##   region        n
##   <fct>     <int>
## 1 Midwest       4
## 2 Northeast     1
## 3 South         2
## 4 West          1


Stratified Sample

# Stratified sample
states_str <- us_regions %>%
  group_by(region) %>%
  sample_n(size = 2)

# Count states by region
states_str %>%
  group_by(region) %>%
  count(region)
## # A tibble: 4 x 2
## # Groups:   region [4]
##   region        n
##   <fct>     <int>
## 1 Midwest       2
## 2 Northeast     2
## 3 South         2
## 4 West          2


Principles of Experimental Design

  • Control: Compare the treatment of interest to a control group.
  • Randomize: Randomly assign subjects to treatments.
  • Replicate: Collect a sufficiently large sample within a study or to replicate the entire study.
  • Block: Account for the potential effect of known or suspected confounding variables.

Identifying Components of a Study

A researcher designs a study to test the effect of light and noise levels on exam performance of students. The researcher also believes that light and noise levels might have different effects on males and females, so she wants to make sure both genders are represented equally under different conditions.

There are 2 explanatory variables (light and noise), 1 blocking variable (gender), and 1 response variable (exam performance).

Experimental Design Terminology

'Explanatory' variables are conditions you can impose on the experimental units, while 'blocking' variables are characteristics that the experimental units come with that you would like to control for.

Connect Blocking and Stratifying

In random sampling, we use 'stratifying' to control for a variable. In random assignment, we use 'blocking' to achieve the same goal.